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Abstract 

We outline a procedure for jointly sampling substitution matrices and multiple 
sequence alignments, according to an approximate posterior distribution, using an 
MCMC-based algorithm. This procedure provides an efficient and simple method by 
which to generate alternative alignments according to their expected accuracy, and 
allows appropriate parameters for substitution matrices to be selected in an auto¬ 
mated fashion. In the cases considered here, the sampled alignments with the highest 
likelihood have an accuracy consistently higher than alignments generated using the 
standard BLOSUM62 matrix. 


Introduction 

Most commonly used sequence alignment programs (e.g. m) make use of a substitution 
matrix to specify the score associated with aligning different types of amino acids. Much 
work has been focused on the development of improved substitution matrices to improve 
the accuracy of the resulting alignments, and various algorithms have been developed to 
this end. 

One approach is to take a set of reference alignments, and to derive parameters that gen¬ 
erate alignments that best match this reference set, either by matching the substitution 
parameters to observed statistics [IHS], or by varying parameters in order to maximise the 
alignment accuracy with respect to the reference set [MS]. An alternative approach is 
to iteratively align the set of sequences, at each iteration deriving a new matrix from the 
observed pair frequencies within the aligned dataset [l4] . 

While the early substitution matrices consisted of a small set of alternative matrices tailored 
for sequences of differing sequence identity later approaches allowed for these matrices 
to be adjusted to account for the amino acid content in a specific set of sequences m, as well 
as allowing for multiple classes of substitution matrix at different sites in the sequence [Si- 
Further improvements in accuracy have been obtained by accounting for of the primary HZ], 
secondary [T^IITM^ and tertiary structural context of each residue, or by using 

matrices tailored for specific types of proteins [24l[25] . 
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Impact of substitution matrix on alignment uncertainty 

An issue arising with all of the above approaches is that once the substitution matrix is 
selected, it is typically regarded as fixed for the purposes of subsequent analyses. Some 
algorithms include steps to select different substitution matrices at various stages of a pro¬ 
gressive alignment j3], but these matrices are sampled from a fixed initial set. 

Since the resulting alignments will often be sensitive to the choice of matrix parameters, 
especially for more divergent sequences [3], the choice of substitution matrix may lead 
to significant bias in the resulting alignments. In some cases even small changes to the 
procedures used to derive optimal matrices may make a noticeable difference to the resulting 
alignment accuracy |26j . 

In addition, although procedures exist for adapting substitution matrices to sequences of 
interest [HIISIIIT], the optimisation of the matrix parameters is generally carried out on 
the reference alignments rather than the sequences under analysis; unless explicitly ac¬ 
counted for, any differences in composition between datasets may be a further source of 
unpredictability. 

Although several procedures have been devised for assessing the reliability of alignments (e.g. 
P51 - I5D] 1. these procedures do not account for the uncertainty in the choice of substitution 
matrix, such that the reliability in the alignment will usually be overestimated. 


Joint sampling approaches 

One way to address this problem is to simultaneously sample substitution matrices and 
alignments from a joint posterior distribution, thereby incorporating parameter uncertainty 
into the alignment estimation. The approach of Zhu et al. |31j focuses on pairwise align¬ 
ments, and defines the space of possible substitution matrices as the BLOSUM or PAM 
series. In contrast, StatAlign is able to estimate parameters for arbitrary matrices once 
converted into rate matrix form, and does so while simultaneously sampling multiple se¬ 
quence alignments and phylogenetic trees |32j . However, this full joint sampling process is 
computationally intensive, limiting its application to larger numbers of sequences. 

Here we explore an intermediate approach whereby substitution matrices are sampled from 
an approximate posterior distribution, and a single optimal alignment generated for each 
sampled matrix. This approach allows the effect of parameter uncertainty to be propagated 
into the alignment inference, while retaining much of the tractability of commonly used 
alignment algorithms. 


Methods 


Substitution matrices are frequently defined in terms of log odds scores for pairwise homology 
statements. In the case of PAM matrices, these log odds scores are derived via conditional 
probabilities of one amino acid mutating into another within a particular time interval [3]. 
BLOSUM matrices [5] are similarly based on the joint probabilities of observing a particular 
amino acid pairing in a set of reference alignments, with matrix entries defined as 


Ma,b = 


2 ^ p(a o b) 


( 1 ) 
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where aob denotes a homology statement between characters a and b, and p{a) denotes the 
background probability of character a. Rearranging and exponentiating, we obtain 

p(a ob) = p{a)p{b) exp {XMa,b} (2) 

with A defined such that J2abPi^ o b) = 1. As discussed by Yu et al. [TS], for a given 
substitution matrix, M, the matrix of pair probabilities can be uniquely recovered via the 
relationship P = Y~^, where Yij = 2^^'^ /2. 

While the pairing probabilities, p{a o 6), are usually taken as fixed throughout the analysis, 
we proceed by conducting approximate posterior inference on these parameters. 

Prior probability of substitution matrices 

We use the pair frequencies from the BLOCKS database [33] in order to construct an 
informative prior for the substitution matrix. To do so, we place the the following prior 
on the pair frequencies, centred around the observed frequencies used in the BLOSUM62 
matrix 

f,j - Gamma(/,j/cr^ (3) 

which has mean fij and variance a^. The pairing probabilities are then derived as 

Piioj) = (4) 

One could in principle sample cr^ from its posterior using MCMC, but in the current applica¬ 
tion we fix cr = 10^ (for comparison the standard deviation between the empirical frequencies 
is 6656.3), leading to a prior that moderately constrains the pair probabilities around the 
values in the original BLOSUM62 matrix, but allows for some significant variability, as 
shown in Figure [TJ 
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Figure 1 - The prior distribution on pair frequencies, shown for each amino acid pairing, as defined in equation m- Each block of boxes (shown in the same colour, with boundaries 
highlighted by the alternating grey/white background) corresponds to pairings with the amino acid shown on the x-axis. Within each block, the ordering of the second letter of the pair 
follows the ordering of the first letters on the axis label. In each case the mean is centred around the observed frequencies taken from the counts in the BLOCKS database used to derive the 
BLOSUM62 matrix. Self pairings (corresponding to the right-most box in each group) typically show a much higher frequency, reflecting the level of sequence conservation in the database. 
Due to the symmetry of the matrix, only the upper triangle and diagonal elements are shown, such that moving from right to left across the plot involves progressively removing the first 
element of the group (for example, the first box in the ’V’ block corresponds to a ’VA’ amino acid pairing, whereas the first box in the ’V’ block corresponds to a ’YN’ pairing, and the first 
box in the ’W’ block to a ’WD’ pairing). 



Likelihood 


A sum-of-pairs alignment objective score is equivalent to the log likelihood under a Markov 
random field model, with independence between alignment columns. The overall likelihood 
of a set of sequences, S, given an alignment. A, and parameters, 0, can be written as 


L K 


K-l K 


p(S'I A,0) oclogJ|]^p(Aj, I 0) 


p{AkioAii I 0) 


p{Am I Q)p{Au I 0) 

(-1 K 

iogp{s\A,e) = EE logp{Aji I 0) +EE E XMAki.Aii +const. 


L K 


L K-l K 


2=1 


sequence log likelihood 


Z=fe+1 


alignment score 


(5) 

( 6 ) 


where L is the number of columns in the alignment, K is the number of sequences, and Aki 
represents either the character from sequence k that is aligned to the ith column, or a gap 
character. The marginal probabilities p{Aki \ 0) can also be modified to incorporate an 
affine gap penalty, such that 


p{Aki = -) 


g if Ak^t-i = - 

h if .T/j; 2—1 ^ 


(7) 


with g < h. 


Approximate marginal likelihood 

As discussed earlier, full posterior sampling of multiple alignments is computationally in¬ 
tensive, such that joint estimation of substitution matrices and alignments is currently 
impractical on datasets with larger numbers of sequences. 

In order to increase computational efficiency, we adopt an approximate procedure whereby 
the marginal likelihood of the sequences given a set of parameters 0 is assumed to be equal 
to the value of the likelihood corresponding to the alignment with the optimal score 

p{S I 0) « supp{S I A, 0) (8) 

A 

This approximation effectively asserts that the contribution of suboptimal alignments to the 
posterior is negligible, or, equivalently, that the prior on alignments for a particular 0 is 
a point mass at the maximum likelihood alignment. Under this approximation, alignment 
uncertainty is determined by the variance in 0. 


Sampling according to expected alignment accuracy 

The search for alignments that maximise the target score is predicated on the assumption 
that the score is positively correlated with alignment accuracy for a given set of parameters. 
For pairwise alignments, the approximate distribution of alignment scores can be derived 
under various assumptions [34]. However, for multiple sequences the objective scores used 
within the alignment programs may not carry sufficient information to predict alignment re¬ 
liability |35] . As such, many alternative metrics have been developed for assessing alignment 
quality [361139] . 
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In this study, we consider a measure of alignment quality that is based on the number of 
non-gap homology statements in the alignment 


q{A)=Y^{K-gM)){K-gM)-l)l2 (9) 


where K is the number of sequences, and gi{A) denotes the number of gaps in column i in 
alignment A. 

As discussed in the Results section, empirically we observe a strong correlation between 
q{A) and the alignment accuracy. For a random set of alignments, this quantity will not in 
general be correlated with the alignment accuracy, since it does not account for the sequence 
content. However, for optimal-scoring alignments generated according to equation (jS]), any 
predicted homology statements must show significant evidence of being non-random in order 
to be included in the alignment, hence an alignment containing more homology statements 
should have a higher expected accuracy. 

Alternative likelihood function 

Given the empirically observed correlation between q{A) and the alignment accuracy, we 
opt to sample substitution matrices according a modified marginal log likelihood of the form 

logp(S' I 0) = ;^g(i[0]) (10) 

where A[0] = supj^p{S \ A, 0) is the optimal-scoring alignment under the original likelihood 
in equation and t(S) is a measure of the variability in the alignment of the set of 
sequences S. 

The quantity q{A) can be roughly approximated as a sum of N independent squared binomial 
variables with n = K and p = x/if, where x is the expected number of non-gap characters 
per column. Denoting the average sequence length by L, then we must have = ATL, 
such that N = L/p. Using the delta method, the approximate variance of each of these 
variables will be 0{x^{\ —p))- Hence, ignoring variability between the number of characters 
in each column, to a first approximation the variance of q{A) for random alignments will be 
of the order of (1 — p)x^N = PK^L, where /3 = (1 — p)p'^, with a maximum value of 0.148, 
when p = 2/3. With a uniform prior on /3, the posterior mean is approximately 0.08. 

We therefore set t to be the following function of the number of sequences, K, and the 
average sequence length, L: 

t{S) = (11) 

In principle one could sample P from its posterior distribution, but the estimation of this 
quantity will be affected significantly by considering only a single alignment for each substi¬ 
tution matrix. In our analyses we fixed the value to 0.025, below the approximate expected 
value for random alignments, but large enough to allow efficient traversal between modes of 
the likelihood. 

Alignment accuracy 

To measure the accuracy of generated alignments, we opt for the alignment metric accuracy 
(AMA) score introduced by Schwartz et al. [ID] , since this possesses no inherent bias towards 
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long or short alignments. To define the AM A score metric, we first define the following sets 


'Hh{A) = {{cj,Ck) I • {Aij ^ -), {Aik ^ -)} 
'Hd{A) = {(cj,0) I 3i • {Aij ^ -), {Aik = -)} 
'^i{A) = {(0, Ck) I 3i ■ {Aij = -), {Aik ^ -)} 

'Hn{A) ='Hd{A) LI'Hi{A) 


pairwise homology statements 
pairwise deletions 
pairwise insertions 
pairwise non-homology statements 


where Cj S {1,..., \Aj\}, and \Aj \ is the length of the jth sequence. With these definitions, 
the accuracy of a predicted alignment, P, relative to the true alignment, T, is given by 


a{P,T)=a{T,P) = 


2|Hg(p) nHg(r)| + \nN{P)n'HN{T)\ 

{K-l)Ek\Ak\ 


( 12 ) 


Varying the effective gap penalty 

Allowing the overall magnitude of the substitution matrix to vary is analogous to varying the 
gap parameters. Hence, by allowing the A parameter to vary, it is possible to simultaneously 
explore alternative substitution matrices and gap parameters, assuming a fixed ratio between 
the gap opening and extension penalties. 

However, using a likelihood of the form in equation (|10p . caution must be exercised when 
allowing A to vary, since large values of A will effectively render gaps exceedingly unlikely, 
causing the sequences to be over-aligned. As an example of this. Figure [5] shows the posterior 
relationship between A and alignment accuracy when A is allowed to vary freely (i.e. with 
an uninformative prior), with gap parameters g = —10, h = —1. Up to a certain point 
(roughly A = 1.5), increasing A is associated with increased alignment accuracy, but beyond 
this the accuracy deteriorates rapidly, although the number of homology statements in the 
alignment continues to increase. 

A similar pattern was observed in several other datasets (not shown), hence we opted to use 
an informative prior for A centred around unity, designed to keep A approximately within 
the range [0.5,1.5]. 


p(A) = N(logA|0,0.1) 


(13) 


MCMC scheme 

The general MCMC scheme we adopt consists of proposing a new set of pair frequencies, /*, 
computing the corresponding log-odds substitution matrix, M*, recomputing the optimal 
alignment for the new substitution matrix under the scoring scheme in equation ([6]), and 
then accepting or rejecting the new matrix based on the equation (Hnn.i.e. accepting when 


logC/(0,1) < log 


p{S\M*,g,h) 
P{S I M,g,h) 


+ log 


p(r 1/,^^) 

Pif I 


where p( • ) is as defined in equation (ITU)) . 


(14) 


In order to improve mixing on the space of substitution matrices, we used two types of 
proposals. These involve taking subsets of matrix entries of size n, where n = 50, 25, and 
adding independent U[—pi,pi] noise to the corresponding frequencies, where i = \ indicates 
n = 50, and i = 2 indicates n = 25. By modifying submatrices of this size, on average 
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Figure 2 - When X is not constrained by an informative prior, it can become very large, 
since this increases the number of homology statements, thereby increasing the likelihood in 
equation m- However, beyond \ = 1 . 5 , this leads to a decrease in alignment accuracy, due 
to over-alignment of the sequences. 


between 90% and 95% of proposed changes to the substitution matrix lead to changes to 
the optimal alignment. 

In order to preserve the symmetry of the matrix, only the upper triangle and diagonal 
elements are modified, with the lower triangle updated accordingly afterwards. Since we 
are using symmetric proposal kernels, no adjustment to the Metropolis-Hastings ratio is 
required. Moves that lead to negative frequencies result in zero density under the prior, and 
hence are rejected. 

The scaling factor, A, is also sampled using a uniform random walk, characterised by a 
parameter p^. 

The perturbation factors, pi, are initialised at 70, 10 and 0.5, and automatically tuned 
during the burn-in period, according to the following procedure: Every 10 iterations, the 
acceptance rate for each move type is queried, and if it does not fall within the specified 
range (set by default to [0.2, 0.4] as per the considerations outlined by [41]), the parameter 
Pi is multiplied or divided by a factor of 0.9, and the acceptance counts reset to zero. The 
different moves are selected according to weights wi = 1 ,W 2 = 0.2 and W 3 = 0.5. 


Implementation details 

In general, finding the alignment that maximises a score of the type in equation (jS]) is NP 
hard [42] . Multiple alignment programs generally make use of heuristic procedures such as 
progressive alignment in order to approximate this optimum. 

In the current implementation of our iterative sampling scheme, we use the program MUS¬ 
CLE [5] in order to generate the (approximately) optimal alignment for each substitution 
matrix. 
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Figure 3 - Time taken to generate 4500 alignment samples under the iterative realignment 
MCMC algorithm, using MUSCLE to realign the sequences, on a single Intel iS 2.4GHz core. 
A significant portion of the runtime involves reading and writing to disk; the runtime could 
therefore be significantly reduced by integrating the perturbation procedure with the alignment 
code. 


Each sampled substitution matrix is written to file, and then an instance of MUSCLE is 
run using this substitution matrix, with the following command 

muscle -gapopen g -gapextend h -matrix M -in seqs.fasta -out ali.fasta 

In order to increase sampling speed, we also used the flag -maxiters 2, which restricts 
the number of refinement steps carried out by MUSCLE. Sampling 3000 alignments after a 
burn-in of 1500 iterations (4500 iterations overall) requires approximately 4.5 minutes for a 
set of 15 sequences, 6 minutes for 33 sequences, 9 minutes for 60 sequences, and 16 minutes 
for 122 sequences, increasing by a factor of approximately 1.5 as the number of sequences is 
doubled. Significant improvements to the runtime could be obtained by incorporating the 
perturbation procedure directly into the MUSCLE code, such that intermediate read/write 
operations can be omitted from the workflow. 


Results 

To evaluate the methodology, we conducted analyses on four datasets taken from the 
OXBench database [33]. To generate these datasets, we selected one of the largest align¬ 
ments from the OXBench suite [118], consisting of 122 immunoglobulin sequences, with 
average length 113. To assess how the alignment sampling method scaled with the number 
of sequences after controlling for other factors (such as amino acid content and sequence 
length), we subsampled smaller datasets from this alignment, yielding datasets with 15, 33, 
60 and 122 sequences. These subsets were sampled so as to maximise dissimilarity within 
the subset, since the original alignment contained several well-defined subgroups that would 
otherwise skew the analysis. 
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Figure 4 ~ Trace plot of the number of homology pairs in the alignment over the course of 
4500 MCMC iterations, with a burn-in period of 1500 at the start, for K = 15, 33, 60,122 (left 
to right). This quantity plays the role of a log-likelihood in these simulations (after weighting 
by the factor t). 


# seqs 

(log likelihood) 

(A) 

(# pairs) 

15 

0.22 

0.23 

0.47 

33 

0.21 

0.36 

0.59 

60 

0.30 

0.20 

0.58 

122 

0.16 

-0.01 

0.58 


Table 1 - While the log likelihood of the optimal alignment has a weak positive correlation 
with the alignment accuracy, the number of pairwise homology statements is more strongly 
correlated with the accuracy (linear correlation coefficients shown in bold). The scaling factor 
A is also only weakly correlated with the accuracy. The log likelihood and number of pairs have 
a very low correlation with each other (R^ = 0.07,0.18, —0.01,0.01 respectively). 


MCMC sampling was carried out for 4500 iterations, with the initial 1500 as burn-in. As 
shown by the trace plots in Figures HHHl the key summary statistics appear to have success¬ 
fully converged by the end of the burn-in. 

The log likelihood shown in Figure [S] (corresponding to equation ([S])) decreases during 
the burn-in. This quantity is not strongly correlated with the alignment accuracy (see 
Figure Table [7]) , and is uncorrelated with the alternative likelihood used when accept¬ 
ing or rejecting the proposals according to equation (fTTl) {B? = 0.07,0.18,-0.01,0.01 for 
K = 15,33,60,122). Similarly, the multiplier A is also only weakly correlated with the 
alignment accuracy (see Tahle\^. 

In contrast, the number of homology pairs is strongly correlated with the alignment accuracy 
(see Figure\^ Table\^, justifying the use of the alternative likelihood in equation (fTU|) . 

It is also notable that a significant proportion of the sampled alignments have accuracy 
higher than the alignment generated using the original BLOSUM62 matrix (shown in red 
in Figures [7][5] and Figure [S]). 

On average, the sampled alignments are similar in length to the true alignments shorter 
(posterior mean lengths of 140, 149, 150 and 153, compared to the corresponding bench¬ 
mark OXBench alignment lengths of 144, 150, 152 and 157 for K = 15,33,60,122 respec¬ 
tively), while the alignments generated with the original BLOSUM62 matrix deviate more 
significantly in length when compared to the benchmarks (lengths 141, 141, 146 and 150 
respectively). 
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Figure 5 - Trace plot of the original log likelihood in equation w over the course of 4500 
MCMC iterations, with a bum-in period of 1500 at the start, for K = 15,33,60,122 (left to 
right). In all cases, this quantity decreases substantially during the burn-in, since it is typically 
uncorrelated with the overall number of homology pairs in the alignment. 






Figure 6 - Trace plot of the A parameter that acts as an inverse multiplier on all the entries 
of the substitution matrix, for K = 15, 33, 60,122 (left to right). 


K = 15 


K = 33 


K = 60 



Log likelihood Log likelihood Log likelihood Log likelihood 

Figure 7 " Although there is some correlation between log likelihood and alignment accuracy, 
it is generally weak. The alignment generated using the original BLOSUM62 matrix is shown 
in red. 
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K = 15 K = 33 K = 60 K = 122 
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Pairwise homology statements Pairwise homology statements Pairwise homology statements Painwise homology statements 


Figure 8 - The number of homology statements in the alignment is strongly correlated with 
the alignment accuracy. Although this will not be true in general for randomly generated align¬ 
ments, for optimal alignments generated by a program sueh as MUSCLE, predicted homology 
statements have a higher probability of being accurate, hence larger numbers of homology state¬ 
ments is correlated with overall accuracy, justifying the form of the alternative likelihood in 
equation m- The alignment generated using the original BLOSUM62 matrix is shown in red. 
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Figure 9 - The distribution of alignment accuracy scores typically includes a large number of 
alignments with accuracy greater than that of the alignment derived from the initial BLOSUM62 
matrix. As more sequences are added to the dataset, the average accuracy increases, due to 
additional information contained in the dataset. The variability remains roughly constant due 
to the dependency of r on K in equation m- 
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Figure 10 - A comparison of the matrix entries in the original BLOSUM62 matrix (x-axis) 
with the entries of the matrix yielding the highest accuracy alignment on the QO-sequence 
dataset. While the distributions are centred around the initial values, there is significant vari¬ 
ability, particularly for the negative entries (corresponding to pairings with a low probability). 


Small perturbations can have a large effect on the resulting align¬ 
ment 

Although the perturbations to the substitution matrix are generally small, constrained fairly 
strongly by the prior, these small modifications to the substitution matrix can often make 
a large difference to the resulting alignment. As an example, for the 60-sequence OXBench 
dataset, while the original BLOSUM62 matrix in conjunction with MUSCLE yielded an 
alignment with AMA score of 0.73, the 95% highest posterior density interval spans the 
range [0.70,0.79], and the maximum accuracy yielded by one of the sampled matrices is 0.80. 
Figure (TU] shows the distribution of entries of the matrix yielding this maximum accuracy 
alignment, illustrating how the entries are centred closely on the values in the original 
BLOSUM62 matrix, but with higher variance for the more negative entries (corresponding 
to low pairing probabilities). 


Posterior gap multiplier 

As shown in Figure [51 the A multiplier parameter appears to converge and mix relatively 
successfully. Examining the posterior distribution of this quantity shows that it remains 
strongly constrained by the prior (see Figure [H, but shifted slightly upwards towards 
higher values of A. 
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Conclusions 


The method outlined here provides a simple approach for incorporating parameter uncer¬ 
tainty into score-based alignment, generating a set of alternative multiple alignments for a 
set of sequences, and sampling substitution matrices based upon an approximate likelihood 
that is a good predictor of the accuracy of the resulting alignments. 

In the examples considered here, varying the substitution matrix parameters can signifi¬ 
cantly affect alignment accuracy, and the high-likelihood alignments typically have a higher 
accuracy than the alignment generated using the standard BLOSUM substitution matrix. 

As well as providing a way to address parameter uncertainty, this approach provides a 
way of systematically generating alignments according to a probability distribution that is 
correlated with alignment accuracy. These sampled alignments may provide a useful starting 
point for assessing the affect of alignment uncertainty downstream analyses [44]. 

As discussed by Herman et al., the minimum-risk summary alignment derived from a set of 
posterior alignment samples typically is more accurate than the majority of the individual 
samples [44] . The minimum-risk summary procedure could be used in conjunction with the 
sampling procedure outlined here in order to generate more reliable alignments. 

More extensive tests of this procedure on alignment benchmark databases are required 
to determine whether this observed relationship between number of homology pairs and 
alignment accuracy holds in a more general context. 


Additional material 

BLOSUM matrices and pair frequencies were downloaded from 

ftp;//ftp.ncbi.nih.gov/repository/blocks/unix/blosum/blosum.tar.Z 

(last accesed 18 January, 2015). 
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